AD-A1S9  737 


UNCLASSIFIED 


ASCAL:  A  HICROCONPUTER  PROGRAN  FOR  ESTINATINO  LOGISTIC 
IRT  (ITEN  RESPONSE..  (U)  ASSESSHENT  SYSTEHS  CORP  ST  PAUL 
HN  C  D  YALE  ET  AL.  11  NOV  85  85-4-ONR  NMG14-25-C-G634 


F/Q  9/2 


1/1 

NL 


tclasslf led 


LASSIFICATION  Of  THIS  PAGE 


REPORT  SECURITY*  CLASSIFICATION 
unclassified 


SECURITY  CLASSIFICATION  AUTHORITY 


DECLASSIFICATION  /  DOWNGRADING  SCHEDULE 


PERFORMING  ORGANIZATION  REPORT  NUM8ER(S) 


REPORT  DOCUMENTATION  PAGE 


ONR-85-4 


,  NAME  OF  PERFORMING  ORGANIZATION 
ssessment  Systems  Corporation 


6b.  OFFICE  SYMBOL 
(If  applicable) 


ADDRESS  (City,  State,  and  ZIP  Code) 

233  University  Avenue,  Suite  310 
t.  Paul,  MN  55114 


.  NAME  OF  FUNDING /SPONSORING 
ORGANIZATION 

‘ersonnel  and  Training  Researcl 


8b.  OFFICE  SYMBOL 
(If  applicable) 


lb.  RESTRICTIVE  MARKINGS 


3.  DISTRIBUTION  /AVAILABILITY  OF  REPORT 
approved  for  public  release; 
distribution  unlimited 


5.  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 


7a.  NAME  OF  MONITORING  ORGANIZATION 
Office  of  Naval  Research 


7b.  ADDRESS  (City,  State,  and  ZIP  Code) 

800  N.  Quincy  Street,  Code  442 
Arlington,  VA  22217 


9.  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 
N00014-83-C-0634 


10  SOURCE  OF  FUNDING  NUMBERS 


PROGRAM 
ELEMENT  NO. 


PROJECT 

NO. 

N  507-002 


.  ADDRESS  (City,  State,  and  ZIP  Code) 

)ffice  of  Naval  Research 
500  N.  Quincy  Street,  Code  442 
rlington,  VA  22217 


.  TITLE  (Include  Security  Classification) 

ASCAL:  A  Microcomputer  Program  for  Estimating  Logistic  1RT  Item  Parameters 


'  PE«vA&HWe  &  Kathleen  A.  Gialluca 


WORK  UNIT 
ACCESSION  NO 


la.  TYPE  OF  REPORT 
technical  report 


13b.  TIME  COVERED  14.  DATE  OF  REPORT  (Year,  Month,  Day)  15.  PAGE  COUNT 

FROM  9/1/83  Tdl/11/85  85  Nov  11  17 


COSATI  CODES 


SUB-GROUP 


18  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 
adaptive  testing,  ASCAL,  item  analysis,  item  banking,  item 
calibration,  IRT,  item  response  theory,  LOGIST,  MicroCAT, 
parameter  estimation,  tailored  testing,  test  analysis 


9.  ABSTRACT  ( Continue  on  reverse  if  necessary  and  identify  by  block  number) 


ASCAL  is  a  microcomputer -based  program  for  calibrating  items  according  to  the 
three-parameter  logistic  model  of  item  response  theory.  ASCAL  employs  Lord's 
(1974)  modified  likelihood  equations  (for  items  that  are  omitted  or  not  reached) 
and  Bayesian  prior  distributions  on  the  discrimination,  guessing,  and  ability 
parameters  to  arrive  at  final  estimates  of  the  item  parameters.  No  ability 
parameters  are  produced. 

ASCAL  uses  a  modified  multivariate  Newton-Raphson  procedure  for  estimating 
item  parameters.  The  estimation  process  begins  by  specifying  starting  points  for 
the  ability  and  item  parameters.  In  this  procedure,  abilities  are  first 
estimated  using  the  preliminary  estimates  of  the  item  parameters.  After  abi  lity 
estimates  have  been  obtained  for  all  examinees,  they  are  sorted  and  grouped  into 


20.  DISTRIBUTION /AVAILABILITY  OF  ABSTRACT 
CD  UNCLASSIFIED/UNLIMITED  □  SAME  AS  RPT. 


22a  NAME  OF  RESPONSIBLE  INDIVIDUAL 
Charles  E,  Davis 


□  otic  users 


21  ABSTRACT  $E<"  "ITY  CLASSIFICATION 
unclassiti.  I 


22b.  TELEPHONE  (Include  Area  Code)  22c.  OFFICE  SYMBOL 

202-696-4046 


D  FORM  1473, 84  MAR 


83  APR  edition  may  be  used  until  exhausted. 
All  other  editions  are  obsolete 


SECURITY  CLASSIFICATION  QF  THIS  PAGE 
unclassified 


Block  16.  (Continued) 


20  fractiles  with  approximately  equal  numbers  of  examinees  in  each.  The  item 
parameters  are  then  estimated  by  assuming  that  the  20  fractile  means  are 
representative  of  the  entire  ability  distribution.  The  sequence  of  ability 
estimation,  ability  grouping,  and  parameter  estimation  is  repeated  until  the  item 
parameters  converge  on  stable  values  or  fail  to  improve. 

This  procedure  was  evaluated  using  Monte  Carlo  simulation  techniques.  The 
current  version  of  ASCAL  was  then  compared  to  the  current  version  of  LOGIST 
(Wingersky,  Barton,  &  Lord,  1982)  and  a  prior  version  of  ASCAL  that  was  used  to 
estimate  the  parameters  in  the  adaptive  item  pool  of  the  Armed  Services 
Vocational  Aptitude  Battery.  Parameters  were  estimated  for  the  items  in  each  of 
three  different  simulated  tests.  Three  different  evaluative  criteria  were  used: 
(1)  root  mean  squared  error  between  true  and  estimated  item  parameters,  (2)  the 
correlation  between  true  and  estimated  item  parameters,  and  (3)  calibration 
efficiency.  This  last  criterion  measures  the  amount  of  information  in  the 
estimated  parameters  relative  to  the  information  in  the  true  parameters  and  is  an 
indication  of  the  joint  effects  of  calibration  error  in  the  individual 
parameters . 

The  results  of  this  evaluation  suggest  that  ASCAL  produces  parameter 
estimates  that  are  at  least  as  accurate  as  those  produced  by  LOGIST.  There  were 
only  minor  (and  inconsequential)  differences  in  the  three  evaluative  criteria 
between  the  two  versions  of  ASCAL.  The  differences  between  ASCAL  and  LOGIST  were 
slightly  larger  than  the  differences  between  the  two  versions  of  ASCAL;  however, 
even  these  differences  were  small.  Item  parameter  error  increased  markedly  (for 
each  of  the  three  parameters)  according  to  all  three  criteria  and  for  all 
calibration  procedures  when  items  with  an  exaggerated  range  of  difficulty 
parameters  were  calibrated.  Only  the  difficulty  parameter  correlations  were 
unaffected  by  this  manipulation. 
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INTRODUCTION 


The  MicroCATtm  Testing  System  is  a  self-contained  system  of  programs  for 
developing,  administering,  and  evaluating  psychological  tests  using  both 
adaptive  and  conventional  strategies  (Assessment  Systems  Corporation,  1984). 
The  system  runs  on  IBM  personal  computers,  either  individually  or  in  a  local 
area  network.  MicroCATtm  uses  item  response  theory  (IRT;  Lord  &  Novick, 
1968)  to  calibrate  adaptive  test  items.  The  program  within  the  MicroCATtm 
system  that  performs  IRT  item  calibration  is  called  ASCAL. 

ASCAL  was  the  first  IRT  calibration  program  for  microcomputers.  The 
most  popular  IRT  calibration  program,  LOGIST  (Wingersky,  Barton,  &  Lord, 
1982),  was  available  only  for  mainframe-class  machines  and  was  typically  quite 
expensive  to  run.  It  was  not  unusual  for  the  cost  of  calibrating  the  items  in  a 
single  test  to  exceed  $50  for  computer  time  alone. 

Three-Parameter  Logistic  IRT  Model 

Both  LOGIST  and  ASCAL  estimate  parameters  for  the  three-parameter 
logistic  IRT  model,  which  was  developed  for  use  with  dichotomously  scored 
multiple-choice  items.  The  item  response  function  in  this  model  expresses  the 
probability  of  a  correct  response  as  a  function  of  examinee  ability,  theta  (O), 
and  three  item  parameters,  a,  b,  and  c.  The  item  response  function  is  an  S- 
shaped  curve  going  from  a  lower  asymptote  equal  to  the  c  parameter  (the 
pseudo-guessing  parameter)  to  a  maximum  value  of  1.0.  The  midpoint  of  this 
curve  has  its  projection  onto  the  ability  scale  at  b  (the  item  difficulty 
parameter).  The  slope  or  rate  at  which  the  probability  increases  as  a  function 
of  ability  is  a  function  of  the  a  parameter  (the  discrimination  parameter).  The 
three-parameter  IRT  model  is  specified  in  Equation  1. 
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LOGIST’s  method  for  estimating  item  parameters  is  based  on  an  adaptation, 
proposed  by  Lord  (1974),  of  the  maximum  likelihood  principle.  This  model 
makes  allowances  for  items  that  are  unanswered  or  omitted  by  an  examinee. 
When  all  examinees  respond  to  all  items  in  the  test.  Lord’s  procedure  is 
equivalent  to  the  classical  maximum  likelihood  procedure. 

Lord  made  a  distinction  between  items  that  were  omitted  and  those  that 
were  not  reached.  Unanswered  items  embedded  within  the  sequence  of  items  to 
which  the  examinee  did  respond  were  considered  omitted  items.  That  is,  it  was 
assumed  that  the  examinee  had  the  opportunity  to  respond  to  the  items  but 
chose,  for  some  reason,  to  skip  or  omit  those  items.  On  the  other  hand, 
unanswered  items  at  the  end  of  a  test  were  considered  not  reached.  In  this 
case,  it  was  assumed  that  the  examinee  did  not  have  enough  time  to  respond  to 
all  of  the  items  in  the  test.  Similarly,  items  that  are  never  presented  to  some 
subset  of  examinees  can  be  considered  not  reached  for  those  examinees. 

In  Lord’s  scheme,  unreached  items  are  not  used  to  estimate  an  examinee’s 
ability;  the  scored  response  vector  does  not  include  the  block  of  unanswered 
items  that  occurs  at  the  end  of  a  test.  To  score  unreached  items  as  incorrect 
would  mean  that  an  examinee’s  score  is  depend  ;nt,  at  least  partially,  on  the 
speed  with  which  he  or  she  answered  the  items  on  the  test.  This  is  a  major 
violation  of  the  assumptions  of  IRT,  which  models  the  probability  of  a  correct 
answer  on  item  characteristics  and  the  examinee’s  ability  only. 

On  the  other  hand,  examinees  are  given  partial  credit  for  those  items  that 
they  chose  to  omit.  The  rationale  for  this  is  twofold:  (1)  it  is  assumed  that  the 
examinees  were  administered  these  items  and  had  ample  time  to  respond  to 
them,  and  (2)  they  would  have  answered  some  of  these  items  correctly  if  they 
had  guessed  rather  than  skipped  the  item. 

Bayesian  Estimation  of  Ability  and  Item  Parameters 

Maximum  likelihood  procedures  for  estimating  item  parameters  have 
several  desirable  characteristics  (for  example,  under  certain  conditions,  they 
produce  estimates  that  arc  asymptotically  efficient  and  asymptotically  normally 
distributed).  In  practice,  however,  such  procedures  are  often  difficult  to 
implement.  For  example,  no  maximum  likelihood  ability  estimate  can  be 
obtained  for  an  examinee  if  he  or  she  answers  all  of  the  items  correctly  or 
answers  fewer  items  correctly  than  would  be  expected  by  chance.  Furthermore, 
for  complicated  estimation  situations  (such  as  item  calibration),  numerical 
compromises  must  often  be  made;  a  number  of  ad  hoc  decision  rules  and 
auxiliary  estimation  procedures  were  incorporated  into  LOGIST  during  its 
development.  For  example,  estimates  of  the  c  parameter  cannot  be  obtained  by 
LOGIST  for  items  that  are  too  easy.  This  is  because  typical  data  sets  do  not 
provide  sufficient  information  to  allow  such  parameters  to  be  estimated  by  the 
maximum  likelihood  method  with  any  accuracy.  Thus,  LOGIST  will  set  the  c 
parameters  of  such  items  to  the  average  c  parameter  of  the  remaining  items. 

The  calibration  procedures  of  ASCAL  were  modeled  after  those  in  LOGIST. 
However,  ASCAL  makes  a  different  set  of  numerical  compromises  that  allow, 
in  effect,  a  conceptually  simpler  approach  to  item  calibration.  That  is,  ASCAL 


adds  Bayesian  prior  distributions  to  the  pseudo-likelihood  functions  for  the 
ability  estimates  and  the  a  and  c  parameters  (no  prior  distribution  is  used  for 
the  b  parameter).  Thus,  the  pseudo-maximum  likelihood  estimation  process  of 
LOGIST  is  a  pseudo-Bayesian  estimation  process  in  ASCAL.  The  prior 
distribution  of  ability  chosen  for  ASCAL  is  a  standard  normal  distribution.  By 
convention,  all  three-parameter  IRT  calibration  programs  set  the  scale  of 
ability  to  have  a  mean  of  zero  and  a  variance  of  one;  the  normal  shape  of  the 
prior  distribution  used  here  was  chosen  as  the  most  representative  general  form 
for  a  distribution  of  ability. 

Symmetric  beta  distributions  were  used  as  the  specified  Bayesian  priors  for 
the  a  and  c  parameters.  These  distributions  were  chosen  because  they  provided 
a  continuous  bounding  mechanism  consistent  with  intuition  concerning  probable 
values  of  the  a  and  c  parameters.  The  prior  distribution  for  the  a  parameters 
was  specified  as  fl(3.0,3.0)  with  upper  and  lower  bounds  of  2.60  and  0.30, 
respectively.  The  prior  distribution  for  the  c  parameters  was  specified  as 
B(5.0,5.0)  with  a  lower  bound  of  -0.05  and  an  upper  bound  equal  to  0.05  plus 
twice  the  reciprocal  of  the  number  of  alternatives.  The  classical  beta  density 
functions  were  redefined  (using  a  simple  change-of-variable  transformation)  to 
have  the  bounds  specified  above  instead  of  upper  and  lower  bounds  of  one  and 
zero,  respectively  (see  below).  These  transformations  ensured  that  the  a  and  c 
parameters  were  in  the  appropriate  range.  The  estimates  of  the  b  parameters 
were  bounded  by  +  3.00. 

The  criterion  function  maximized  by  this  pseudo-Bayesian  method  is  shown 
in  Equation  2.  This  is  the  pseudo-likelihood  function  proposed  by  Lord  (1974; 
Wood,  Wingersky,  &  Lord,  1976,  Equation  4)  weighted  by  a  univariate  normal 
distribution  on  theta  and  univariate  beta  distributions  on  a  and  c. 
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Simultaneous  solution  for  the  roots  of  the  derivative  equations  for  all 
parameters  is  numerically  intractable.  Therefore,  ASCAL  uses  the  general 
iterative  procedure  followed  by  LOGIST.  In  this  procedure,  the  ability 
parameters  and  the  item  parameters  are  estimated  separately.  First,  the 
abilities  are  estimated  while  the  item  parameters  are  assumed  to  be  known;  then 
the  item  parameters  are  estimated  while  the  abilities  are  assumed  to  be  known. 
This  process  is  repeated,  and  the  ability  and  item  parameter  estimates  are 
updated  and  refined  at  each  stage.  While  there  is  no  guarantee  that  this  process 
will  converge  on  final  ability  and  item  parameter  estimates,  the  experience 
with  LOGIST  and  ASCAL  has  been  that  acceptable  estimates  are  produced. 


NUMERICAL  IMPLEMENTATION 
Overview 


ASCAL  uses  a  modified  multivariate  Newton-Raphson  procedure  for 
estimating  the  parameters.  The  estimation  process  begins  by  specifying  starting 
points  for  the  ability  and  item  parameters.  Once  these  starting  points  have 
been  obtained,  the  modified  Newton-Raphson  procedure  begins.  In  this 
modified  procedure,  abilities  are  first  estimated  using  the  preliminary  estimates 
of  the  item  parameters. 

After  ability  estimates  have  been  obtained  for  all  examinees,  they  arc 
sorted  and  grouped  into  20  fractiles  with  approximately  equal  numbers  of 
examinees  in  each;  this  grouping  is  done  for  computational  convenience.  After 
grouping,  the  fractile  means  are  weighted  by  the  number  of  subjects  in  each 
fractile  and  then  standardized.  The  mean  ability  for  each  fractile  is  taken  as 
representative  of  all  examinees  contained  in  that  fractile.  The  item  parameters 
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arc  then  estimated  by  assuming  the  20  ability  levels  to  be  representative  of  the 
entire  ability  distribution. 

The  sequence  of  ability  estimation,  ability  grouping,  and  parameter 
estimation  is  repeated  several  times  until  the  item  parameters  converge  on 
stable  values  or  fail  to  improve.  The  details  of  each  of  these  processes  are 
described  below. 

Initial  Estimates 

A  Newton-Raphson  iteration  requires  that  initial  estimates  be  specified  for 
all  of  the  parameters  that  are  to  be  estimated.  These  estimates  must  be 
reasonably  accurate  or  the  iteration  process  may  diverge  and  fail  to  produce 
acceptable  estimates. 

The  initial  ability  estimates  are  obtained  from  raw  formula  scores.  These 
scores  are  computed  as  the  number  of  items  answered  correctly  minus  a 
fraction  of  the  items  answered  incorrectly.  For  each  incorrect  item,  the 
fraction  subtracted  is  equal  to  the  reciprocal  of  one  less  than  the  number  of 
alternatives.  These  formula  scores  are  then  standardized  using  a  linear 
transformation  based  on  the  mean  and  standard  deviation  of  the  formula  scores 
in  the  group  of  examinees. 

The  initial  item  parameters  are  obtained  through  heuristic  transformations 
of  the  classical  item  statistics  as  described  by  Jensema  (1976).  The  c  parameters 
are  estimated  as  the  reciprocal  of  the  number  of  alternatives  for  each  item. 

The  initial  estimates  of  the  a  and  b  parameters  are  then  computed  from  the 
corrected  biserial  item-total  correlation  and  the  proportion  of  examinees 
answering  the  item  correctly. 

In  the  first  of  the  iterations  that  follow,  the  initial  item-parameter 
estimates  are  used  to  obtain  the  first  estimates  of  ability.  The  initial  ability 
parameters  obtained  by  standardizing  the  formula  scores  are  used  only  as 
starting  points  for  the  Newton-Raphson  iteration. 

Ability  Estimation 

Ability  estimates  are  obtained  through  Newton-Raphson  iteration  on  the 
derivative  with  respect  to  theta  of  the  criterion  function  L*  shown  in  Equation 
2.  Using  both  the  first  and  second  derivatives  of  the  criterion  function,  the 
iteration  proceeds  for  each  examinee  individually  until  an  ability  value  is 
found  for  that  examinee  that  results  in  a  root  for  the  first  derivative  of  the 
criterion  function.  The  estimate  thus  obtained  is  very  similar  to  a  Bayesian 
modal  estimator  of  ability.  It  differs  from  a  true  Bayesian  modal  estimator  in 
that  this  estimator  incorporates  Lord’s  pseudo-likelihood  function  rather  than  a 
true  likelihood  function. 


Ability  Grouping 

When  the  item  parameters  are  estimated,  the  criterion  function  must  be 
summed  over  all  ability  levels.  This  could  be  done  for  each  examinee 
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individually  or  the  examinees  could  be  grouped  into  fractiles  and  the  sum 
could  then  be  taken  over  these  groups.  The  latter  procedure  results  in 
considerable  computational  savings,  and  is  the  procedure  employed  by  ASCAL, 
using  20  fractiles.  There  is  precedent  for  grouping  abilities  in  this  way;  such  a 
procedure  is  used  both  in  LOGIST  and  in  a  related  calibration  program, 

LOGOG  (Kolakowski  and  Bock,  1973).  LOGIST  documentation  recommends 
using  grouped  abilities  only  in  some  stages  of  estimation.  Like  LOGOG, 

ASCAL  uses  grouped  abilities  for  all  stages  of  estimation. 

The  ability-grouping  process  begins  by  sorting  abilities  from  low  to  high. 
ASCAL  then  attempts  to  group  these  abilities  into  20  fractiles,  each  containing 
an  equal  number  of  examinees.  To  accomplish  this,  the  ability  level  for  the 
examinee  at  each  20th  of  the  distribution  is  obtained  as  a  bound.  Examinees 
are  then  grouped  into  the  fractiles  with  a  bound  most  immediately  above  their 
ability  levels.  This  process  results  in  equal-sized  groups  except  when  examinees 
with  equal  abilities  span  a  fractile  boundary  or  when  the  number  of  examinees 
is  not  evenly  divisible  by  20.  Typically,  neither  of  these  conditions  results  in 
any  substantial  differences  in  fractile  sizes.  Fractile  divisions  are  done  on  a 
running  total  of  sample  fractions  so  that  round-off  error  does  not  result  in  an 
excessive  number  of  subjects  in  the  last  fractile. 

The  mean  ability  is  then  computed  in  each  fractile  and  taken  as  the 
representative  ability  for  all  examinees  in  that  fractile.  Whereas  the  median 
ability  in  the  fractile  might  be  more  appropriate  for  a  procedure  such  as 
LOGOG  or  LOGIST  which  uses  maximum  likelihood  estimates  of  ability  (with 
the  potential  of  infinite  ability  estimates  on  finite  length  tests)  the  mean 
ability  is  an  appropriate  characterization  for  ASCAL  in  which  the  abilities  are 
obtained  by  a  Bayesian  method. 

The  20  fractile  means  represent  all  ability  levels  in  the  group  of  examinees. 
These  20  ability  levels  are  then  standardized  so  that  they  have  a  mean  of  zero 
and  a  standard  deviation  of  one  (as  was  assumed  for  the  ability  distribution). 
ASCAL  performs  the  standardization  by  weighting  each  fractile  mean  by  the 
number  of  examinees  in  the  fractile.  If  the  grouping  procedure  produced 
exactly  equal  numbers  of  examinees  in  each  of  the  fractiles,  this  weighting 
would  be  unnecessary.  The  weighting  is  done  to  preclude  complications  in 
cases  where  the  numbers  of  examinees  are  not  exactly  equal. 

Parameter  Estimation 

The  item  parameters  are  estimated  for  one  item  at  a  time.  The  a  and  b 
parameters  are  estimated  using  a  Ncwton-Raphson  procedure.  The  c  parameters 
are  estimated  by  systematically  stepping  through  possible  values.  Thus  the  a 
and  b  estimation  procedure  is  performed  within  a  loop  that  steps  through  all 
reasonable  values  of  the  c  parameter. 

For  computational  efficiency,  the  c  parameter  stepping  process  is  done  in 
two  stages.  In  the  first  stage,  trial  values  of  c  are  evaluated  starting  at  the  c 
estimate  from  the  previous  loop  and  stepping  outward  in  steps  of  0.05.  The 
process  first  steps  down  until  a  limit  is  reached  or  the  criterion  function 
decreases.  Then,  if  no  increase  was  observed  while  stepping  down,  the  stepping 


6 


I 

f 


* 

I 

|  process  steps  up  until  the  criterion  function  decreases  or  a  limit  is  reached.  A 

second  stage  of  iteration  is  then  performed  starting  at  the  best  c  value  obtained 
in  the  first  stage  and  proceeding  outward  (in  steps  of  0.01)  to  a  maximum 
distance  of  0.04  from  the  second-stage  starting  point  or  until  the  criterion 
function  fails  to  increase. 

j  At  each  trial  value  of  c,  the  a  and  b  parameters  are  estimated.  These 

I  parameters  are  estimated  jointly  using  a  bivariate  Newton-Raphson  procedure 

|  if  the  matrix  of  partial  second  derivatives  is  positive  definite.  Otherwise,  the  a 

and  the  b  parameters  are  estimated  independently.  To  prevent  wild 
fluctuations  in  the  a  and  b  parameters  during  the  estimation  process,  the 
maximum  adjustment  to  the  a  parameters  is  limited  to  plus  or  minus  0.2  and  the 
maximum  adjustment  to  the  b  parameters  is  limited  to  plus  or  minus  0.4.  The  a 
,  and  b  parameters  are  considered  to  have  converged  for  a  trial  value  of  c  if  the 

|  sum  of  the  absolute  change  in  a  and  the  absolute  change  in  b  is  less  than  0.01. 

If  the  a  and  b  parameters  have  failed  to  converge  after  10  iterations,  the 
change  indicated  for  each  parameter  by  the  Newton-Raphson  procedure  is 
reduced  by  a  power  of  0.8  for  each  additional  iteration.  Thus,  the  prescribed 
change  is  dampened  or  multiplied  by  0.8  on  the  eleventh  iteration,  by  0.64  on 
the  twelfth  iteration,  etc. 


Program  Termination 

ASCAL  terminates  its  iterative  process  either  when  the  parameter  estimates 
converge  on  constant  values  or  when  the  maximum  number  of  iterations 
allowed  by  the  user  is  reached.  The  item  parameters  are  assumed  to  have 
converged  if  the  sum  of  the  absolute  changes  in  a,  b,  and  c  from  one  iteration 
to  the  next  is  less  than  0.01  for  two  consecutive  iterations  on  every  item. 
Alternatively,  if  the  total  number  of  loops  reaches  the  maximum  number 
allowed  by  the  user,  the  program  terminates. 

I  Program  Operation 

The  details  of  program  operation  are  described  in  Chapter  12  of  the  User’s 
Manual  for  the  MicroCAT  Testing  System  (Assessment  Systems  Corporation,  1984). 
ASCAL  was  designed  to  estimate  item  parameters  only  and  was  not  intended  to 
be  a  scoring  program;  thus,  no  ability  estimates  are  provided  for  the  examinees. 
Because  no  editing  capabilities  are  included  in  the  program,  it  is  assumed  that 
all  data  are  input  to  ASCAL  in  correct  and  final  form. 

Input.  The  input  data  file  contains  raw  item  responses  for  each  examinee. 
This  data  file  must  also  contain  the  item-specific  information  required  for  item 
calibration;  this  additional  information  is  included  in  the  data  file  and  not  in  a 
separate  program-control  file.  In  addition  to  the  raw  item  responses,  then,  the 
input  data  file  contains  the  following  information:  (1)  number  of  items,  (2) 
symbol  codes  for  omitted  and  unreached  items,  (3)  rudimentary  formatting 
information  indicating  how  many  columns  of  the  examinee  response  record 
contain  identification  data,  (4)  the  number  of  response  alternatives  for  each 
item,  (5)  the  keyed  response  for  each  item,  and  (6)  a  flag  for  each  item 
indicating  whether  or  not  it  should  be  included  in  the  analysis. 
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The  interactive  operation  of  ASCAL  is  quite  simple.  The  user  needs  only 
to  provide  the  name  of  the  input  file,  the  name  of  the  output  file,  and  the 
maximum  number  of  loops  through  which  the  program  should  proceed,  if  it  has 
not  converged  prior  to  that  point.  Ten  iterations  usually  provide  adequate 
estimates. 

Output.  ASCAL  output  is  a  simple  listing  of  the  following  information:  (1) 
the  user-specified  program-control  parameters,  (2)  the  heuristic  item  parameters 
used  as  initial  starting  values,  (3)  parameter  changes  at  each  stage  through  the 
parameter-estimation  process,  and  (4)  the  final  parameters  and  their  Pearson 
chi-square  lack-of-fit  test  results. 

System  requirements.  ASCAL  runs  on  an  IBM  PC,  PC  XT,  or  PC  AT  and 
many  of  the  IBM  PC-compatible  computers.  Memory  requirements  for  ASCAL 
are  somewhere  between  128K.  and  192K.  ASCAL  will  run  on  systems  either 
with  or  without  the  math  coprocessor  chip;  however,  the  run  time  without  this 
chip  can  be  excessive.  For  example,  for  a  50-item  test  administered  to  2,000 
examinees,  the  typical  time  with  the  coprocessor  chip  is  approximately  2  hours; 
without  the  coprocessor  chip,  the  same  run  may  take  more  than  24  hours.  In 
either  case,  the  results  will  be  identical. 


PROGRAM  EVALUATION 
Method 

ASCAL  was  evaluated  using  a  Monte  Carlo  simulation  (see  Ree,  1978,  or 
Vale  &  Weiss,  1975,  for  a  full  description  of  a  simulation).  In  this  study,  scored 
responses  to  items  with  known  ("true")  a,  b,  and  c  parameters  were  generated 
using  the  three-parameter  logistic  IRT  model.  ASCAL  was  then  used  to 
estimate  item  parameters  from  these  simulated  item  responses.  The  estimated 
item  parameters  were  compared  with  the  true  item  parameters  using  several 
different  criteria.  This  simulation  procedure  is  discussed  in  more  detail  below. 

Simulated  item  responses.  Three  sets  of  true  parameters  were  used.  The 
first  set  of  parameters  was  obtained  from  a  25-item  test  containing  general 
science  items.  To  obtain  an  effective  test  length  of  50,  each  item  and  its 
parameters  were  included  in  the  test  twice.  All  items  in  Test  1  had  four 
alternatives.  Test  1  had  a  restricted  range  of  item  difficulty,  typical  of  a 
conventional  test,  but  inappropriate  for  adaptive  testing.  To  create  a  test  more 
appropriate  for  adaptive  testing,  Test  2  was  created  by  multiplying  all  of  the  b 
parameters  in  Test  1  by  2.0.  Thus  Tests  1  and  2  were  identical  except  for  the  b 
parameters.  Test  3  was  modeled  after  a  57-item  test  of  shop  knowledge.  Each 
item  in  Test  3  had  five  alternatives.  Since  Test  3  was  developed  as  part  of  an 
adaptive  item  pool,  it  had  a  slightly  wider  range  of  difficulty  than  did  Test  1, 
although  its  difficulty  range  was  not  as  wide  as  that  of  Test  2.  Table  1 
presents  the  means  and  standard  deviations  of  the  true  parameters  used  as 
models  for  the  simulations. 
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Item  response  data  were  generated  for  2,000  examinees  for  each  of  the 
three  tests.  Examinee  ability  levels  were  sampled  from  a  standard  normal 
distribution. 

Calibration  programs.  Although  the  objective  of  parameter  estimation  is  to 
obtain  estimates  that  are  identical  to  the  true  values,  in  practice  this  never 
happens.  Thus,  to  provide  a  basis  for  comparison,  the  current  version  of 
LOGIST  (LOGIST5;  Wingersky,  et  al.,  1982)  was  also  used  to  estimate 
parameters  from  these  data. 


Table  1 

Means  and  Standard  Deviations  of  the  True  Item  Parameters  for 
Tests  1,  2,  and  3 


a  1.492  0.410  1.492  0.410  1.041  0.417 

b  -0.090  0.910  -0.179  1.820  0.675  1.069 

c  0.230  0.082  0.230  0.082  0.198  0.028 


Two  versions  of  ASCAL  were  used  to  estimate  parameters  on  these  data. 
Version  2.0  is  the  version  of  ASCAL  that  was  used  for  estimation  of  the 
parameters  of  the  item  pools  for  the  computerized  adaptive  version  of  the 
Armed  Services  Vocational  Aptitude  Battery  (ASVAB).  Since  that  calibration, 
several  numerical  improvements  were  made  to  ASCAL  resulting  in  the  current 
version.  Version  3.0.  Both  versions  were  used  here  to  investigate  the  magnitude 
or  the  differences  between  the  two  versions. 

Evaluative  criteria.  Three  different  criteria  were  used  to  evaluate  the 
accuracy  of  the  item  calibration  programs.  The  root  mean  squared  error 
(RMSE,  the  square  root  of  the  average  squared  difference  between  the  true  and 
estimated  parameters)  was  computed  for  each  of  the  three  (a,  b,  and  c) 
parameters  in  each  of  the  three  tests.  Similarly,  the  Pearson  product-moment 
correlation  coefficient  between  the  estimated  and  the  true  parameters  was 
computed  for  each  parameter  in  each  test.  The  third  criterion  was  a  calibration 
efficiency  criterion. 

The  efficiency  criterion  is  a  relative-information  criterion  suggested  by 
Vale,  Maurelli,  Gialluca,  Weiss,  and  Ree  (1981).  Their  procedure  computes  the 
amounts  of  psychometric  information  (Birnbaum,  1968)  that  would  be  extracted 
from  the  items  if  they  were  scored  using  the  estimated  or  errant  parameters; 
the  relative  efficiency  of  the  estimated  parameters  (i.e.,  the  ratio  of  the 
information  in  the  estimated  parameters  to  the  information  in  the  true 
parameters)  can  be  determined  for  each  theta  level  by  comparing  the  errant 
information  with  the  true  information.  This  process  is  described  below. 
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where  P  -  the  observed  proportion  of  correct  responses  to 
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item  g  in  r  repetitions. 

This  is  simply  Equation  3  with  P  substituted  for  £7  If  the  three- 
parameter  model  holds  and  true  parameters  are  available,  Pg  can  be  computed 
using  Equation  1.  When  this  Pg  is  substituted  into  Equation  7  along  with  true 
item  parameters,  the  root  of  the  equation  is  found  at  &  -  0. 

In  a  simulation,  we  can  control  when  the  true  parameters  are  available  and 
when  estimates  must  be  used.  Let  us  view  Pg  as  the  probability  with  which  an 
examinee  will  respond  to  item  g  with  a  correct  answer.  The  probability  of  an 
examinee’s  response  being  correct  is  governed  by  his  or  her  true  ability  and  the 
true  item  parameters.  Thus,  Pg  should  be  computed  using  0  and  ag,  bg ,  and  c  . 
When  we  estimate  ability  (i.e.,  in  a  real-world  environment),  we  must  use  the 
estimated  parameters.  If  the  parameters  are  in  error,  our  estimate  (t?)  will  not 
converge  on  0  but  rather  on  f\  The  value  of  F  corresponding  to  a  given  0  can 
be  determined  by  substituting  the  true  Pg  into  Equation  7  and  finding  the  root 
using  the  errant  parameters.  Thus,  using*©  to  denote  the  true  ability;  T  to 
denote  the  asymptotic  value  obtain^bleAwith  errant  parameters;  a  ,  b  ,  and  cg 
to  denote  the  true  parameters;  and  ag ,  b  ,  and  cg  to  denote  the  estimated 
parameters,  we  can  rewrite  Equation  7  a*: 
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If  the  errors  of  calibration  are  zero  or  the  estimated  parameters  are 
consistent  with  the  true  parameters,  the  transformation  of  theta  to  gamma  will 
be  linear.  When  this  is  not  the  case,  as  in  almost  all  real  calibration  situations, 
the  transformation  will  be  nonlinear.  This  transformation  from  theta  to  gamma 
completely  describes  the  asymptotic  effect  of  item  parameter  error  on  ability 
estimation. 
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Efficiency.  The  information  at  theta  for  a  specific  test  score  (or  scoring 
function),  X,  can  be  expressed  as  the  ratio  of  (1)  the  squared  derivative  of  the 
expected  value  of  the  scoring  function,  to  (2)  the  variance  of  the  scoring 
function  at  theta  (Birnbaum,  1968,  p.  453): 


1(9)  -  1(0;  X)  - 
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When  the  score  is  a  linear  combination  of  0-1  item  responses,  the  components  of 
the  information  equation  can  be  written  as: 
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where  w  -  w  (6)  is  defined  as  in  Equation  4;  and 
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Pg(9)  -  (1  -cg)  Dag  y[Dag(9-bg)}. 


The  information  available  from  scoring  response  vectors  using  errant 
parameters  can  be  viewed  as  equivalent  to  the  information  available  in  a  linear 
combination  of  item  responses  using  those  weights  determined  to  be  locally  best 
at  r,  using  the  estimates  of  the  item  parameters.  Thus,  substituting  Equations  1 
and  13  and  the  errant  weights  from  Equation  9  into  Equation  10,  the 
information  available  from  the  errant  parameters  is  given  by  Equation  14. 
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Equation  14  represents  the  information  contained  in  the  errant  parameters 
as  a  function  of  9.  To  produce  a  single-quantity  estimate  of  the  information 
available,  this  function  may  be  jointly  integrated  with  a  standard  normal 
density  function  (i.e.,  numerically  integrated). 

To  provide  a  relative  efficiency  index,  the  information  thus  obtained  may 
be  compared  to  information  available  from  the  parameters.  This  information 
may  be  computed  in  the  same  manner  using  true  parameters  throughout,  or  it 
may  be  computed  using  any  of  the  formulas  provided  by  Birnbaum  (1968). 

Efficiencies  for  this  study  were  computed  in  the  manner  described  above. 
Efficiency,  as  reported  herein,  refers  to  the  ratio  of  errant  to  true  information. 

Results 

RMSE.  Table  2  presents  the  root  mean  squared  error  between  the  true  and 
estimated  parameters  for  Tests  1,  2,  and  3.  For  each  of  the  tests  and  each  of 
the  calibration  programs,  the  c  parameter  had  a  smaller  RMSE  than  did  the 
other  two  parameters.  The  largest  RMSEs  in  each  calibration  run  were 
observed  for  the  a  parameter  for  Tests  1  and  2,  and  for  the  b  parameter  for 
Test  3. 

There  were  only  minor  differences  in  parameter  estimation  error  observed 
.  between  the  two  versions  of  ASCAL.  The  RMSE  for  estimating  the  a  parameter 

I  dropped  from  0.150  to  0.132  (from  Version  2.0  to  Version  3.0)  for  Test  1,  and 

increased  from  0.378  to  0.386  for  Test  2.  In  all  other  cases,  the  RMSEs  were 
essentially  the  same  for  both  versions. 

On  the  other  hand,  there  were  consistent  differences  in  the  RMSEs  between 
ASCAL  and  LOGIST.  In  nearly  every  case,  the  RMSE  for  the  parameters 
produced  by  LOGIST  were  larger  than  those  produced  by  either  version  of 
ASCAL;  for  the  Test  1  b  parameters,  for  example,  the  RMSE  for  LOGIST 
estimates  was  more  than  twice  as  large  as  the  RMSE  from  ASCAL.  The  single 
exception  to  this  occurred  for  the  a  parameter  in  Test  3.  Here,  the  LOGIST 
RMSE  was  equal  to  0.150;  ASCAL  Versions  2.0  and  3.0  both  produced 
parameters  with  an  RMSE  of  0.161. 
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Table  2 

Root  Mean  Squared  Error  Between  True  and  Estimated  Item 
Parameters  from  Calibration  Programs  ASCAL  2.0,  ASCAL  3.0, 
and  LOGIST5  for  Tests  J,  2,  and  3 


Test  1 


a 

0.150 

0.132 

0.231 

b 

0.088 

0.088 

0.182 

c 

0.049 

0.049 

0.080 

Test  2 

a 

0.378 

0.386 

0.497 

b 

0.190 

0.189 

0.250 

c 

0.079 

0.079 

0.090 

Test  3 

a 

0.161 

0.161 

0.150 

b 

0.176 

0.176 

0.183 

c 

0.054 

0.055 

0.066 

Correlations.  The  product-moment  correlations  between  the  true  and 
estimated  parameters  are  presented  in  Table  3  for  each  test  and  each 
calibration  program.  These  correlations  ranged  from  .596  to  .952  for  a,  from 
.986  to  .996  for  b ,  and  from  .377  to  .828  for  c. 

Few  differences  in  the  correlations  were  observed  between  the  two  versions 
of  ASCAL.  The  correlations  improved  for  the  a  parameters  in  Test  1  (from 
.935  to  .952)  and  for  the  c  parameters  in  Test  2  (from  .613  to  .620);  in  all  other 
instances,  the  results  from  the  two  versions  of  ASCAL  were  essentially 
identical. 

The  correlations  between  the  true  and  estimated  b  and  c  parameters  were 
lower  for  LOGIST  than  for  ASCAL  across  most  comparisons  of  the  three  tests; 
these  differences  were  pronounced  for  the  c  parameters  and  small  for  the  b 
parameters.  However,  the  differences  in  the  correlations  for  the  a  parameters 
were  not  consistent  between  the  two  types  of  calibration  programs.  For  Test  1, 
the  correlations  for  the  ASCAL-produced  a  parameters  were  .935  and  .952;  for 
LOGIST,  this  correlation  was  .848.  For  Test  3,  the  a  parameters  from  LOGIST 
correlated  more  highly  with  their  true  values  than  did  the  a  parameters  from 
ASCAL  (.934  versus  .926  for  both  versions  of  ASCAL).  There  were  essentially 
no  differences  in  the  a  parameter  correlations  for  Test  2. 


Tabic  3 

Product-Moment  Correlations  Between  True  and  Estimated 
Parameters  from  Calibration  Programs  ASCAL  2.0,  ASCAL  3.0, 
and  LOGIST5  for  Tests  1,  2,  and  3 


Test  1 


a 

0.935 

0.952 

0.848 

b 

0.996 

0.996 

0.986 

c 

0.826 

0.828 

0.612 

Test  2 

a 

0.598 

0.598 

0.596 

b 

0.995 

0.995 

0.991 

c 

0.613 

0.620 

0.498 

Test  3 

a  0.926  0.926  0.934 

b  0  993  0.993  0.990 

c  0.463  0.463  0.377 


Efficiency.  Table  4  presents  the  efficiency  statistics  computed  for  the  three 
calibration  programs  and  the  three  tests.  Calibration  efficiency  ranged  from 
.971  to  .990  in  this  table  and  was  generally  lowest  for  Test  2  for  both  LOGIST 
and  ASCAL.  There  were  essentially  no  differences  in  calibration  efficiency 
between  the  two  versions  of  ASCAL;  these  values  were  .990  versus  .990  for  Test 
1,  .977  versus  .975  for  Test  2,  and  .985  versus  .984  for  Test  3. 


Table  4 

Efficiency  Statistics  from  Calibration  Programs  ASCAL  2.0, 
ASCAL  3.0,  and  LOGIST5  for  Tests  1,  2,  and  3 


Test  1  0.990  0.990  0.981 

Test  2  0.977  0.975  0.971 

Test  3  0.985  0.984  0.986 


The  calibration  efficiency  for  LOGIST  was  lower  than  that  for  ASCAL  for 
Test  1  (.981  versus  .990  for  both  ASCAL  versions)  and  slightly  lower  for  Test  2 
(.971  versus  .977  and  .975).  LOGIST’s  efficiency  was  slightly  higher  for  Test  3 
(.986  versus  .985  and  .984). 

Summary.  There  were  only  minor  (and  inconsequential)  differences  in  the 
three  evaluative  criteria  between  the  two  versions  of  ASCAL.  The  differences 
between  ASCAL  and  LOGIST  were  slightly  larger  than  the  differences  between 
the  two  versions  of  ASCAL;  however,  even  these  differences  were  small. 

The  only  difference  between  Test  1  and  Test  2  in  this  study  was  in  the 
range  of  the  b  parameters  (i.e.,  the  Test  1  difficulty  parameters  were  multiplied 
by  2.0  to  produce  the  difficulty  parameters  in  Test  2).  This  had  a  dramatic 
effect  on  item  calibration  for  both  ASCAL  and  LOGIST:  (1)  for  all  three  item 
parameters,  RMSE  was  substantially  larger  in  Test  2,  (2)  the  correlations 
between  the  true  and  estimated  a  and  c  parameters  were  substantially  smaller 
in  Test  2,  and  (3)  calibration  efficiency  was  lower  for  Test  2.  Only  the  b 
parameter  correlations  were  unaffected  by  this  manipulation. 


DISCUSSION 

The  results  of  this  evaluation  suggest  that  ASCAL  produces  parameter 
estimates  that  are  at  least  as  accurate  as  those  produced  by  LOGIST.  Thus, 
although  different  assumptions  and  different  procedures  are  used  in  estimating 
the  parameters,  the  parameters  obtained  from  either  program  should  be 
considered  of  equivalent  quality. 

However,  considerations  other  than  accuracy  must  also  be  weighed  when 
choosing  between  LOGIST  and  ASCAL.  Convenience  and  cost  would  dictate 
that  ASCAL  be  chosen  because  it  is  very  easy  to  use  and  runs  economically  on 
a  personal  computer.  On  the  other  hand,  ASCAL  currently  has  a  limit  of  100 
items  and  5,000  examinees;  LOGIST’s  corresponding  limits  are  somewhat  larger. 
Furthermore,  since  ASCAL  is  run  with  virtually  no  user-specified  options,  it 
lacks  the  flexibility  of  LOGIST.  Finally,  the  sole  purpose  of  ASCAL  is  to 
estimate  item  parameters.  It  has  no  data  editing  or  transformation  capability 
and  it  does  not  produce  scores  or  ability  estimates  for  the  examinees.  If  any  of 
these  features  are  required,  LOGIST  may  be  a  better  choice  for  item 
calibration.  Otherwise,  the  economy  and  simplicity  of  ASCAL  appear  to  make 
it  a  better  choice. 
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